I am feeling good about the course at the moment. In the beginning, I encountered some errors about connecting Rstudio and Github. I was a bit frustrating at that time, But after I managing to solve it, I feel encouraged.
I have used Rstudio a couple of times before without Rmarkdown. In this course, I expect to learn Rmarkdown and Github. Both of them are powerful tools and I believe that I will use them in my research in the future.
I found this course on sisu and I was attracted by its name, ‘open data science’
I find this book clear and well-organized. I would recommend this book to everyone who starts to learn R
https://github.com/keyanliu1/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 6 03:13:55 2022"
This work includes performing and interpreting regression analysis. Specifically, this work includes plotting the data frame, fitting linear regression model, interpreting results and producing diagnostic plots.
The data is collected from ‘international survey of Approaches to Learning’, made possible by Teachers’ Academy funding for Kimmo Vehkalahti in 2013-2015. with some data wrangling, the final dataset i am working with contains 166 observations and 7 variables. The variables include gender, age, attitude, deep, stra, surf and points.
date()
## [1] "Tue Dec 6 03:13:55 2022"
#library
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# read the data into memory
lrn14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)
#check the dimension and structure of the data
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(lrn14[-1])
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col = gender ), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p2 <- ggpairs(lrn14, mapping = aes(col = gender, alpha=0.3 ), lower = list(combo = wrap("facethist", bins = 20)))
p2
#summary of the data
summary(lrn14)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Part 2.2 shows the graphical overview of the data. The distributions of the variables can be found on the diagonal of the plot. The relationships among the variables are observed on the off-diagonal. R console table summaries the variables.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ deep + stra + surf, data = lrn14)
#summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
From the summary of the model, I find that the stra has a positive statistical relationship with points(0.99), whereas surf variable has negative relationship(-1.63). the effect of deep on points is not significant because the p-value is bigger than 0.1(p-value is the result of a null hypothesis significance test for the true parameter equal to zero). Therefore, I remove the deep variable,include attitude and rerun the model
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = lrn14)
#summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the new model, I find that attitude has a significant positive impact on points. however, the effect of surf is not significant. I remove surf and rerun the model
# create a regression model with multiple explanatory variables
my_model3 <- lm(points ~ attitude + stra , data = lrn14)
#summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
From the summary of the model above, the estimated coefficient of attitude is 3.4658. The interpretation is that, holding other variables unchanged, one unit increase of attitude results in 3.46 unit increase of point. The positive effect is significant as the p-value of the coefficient is very low. using the same interpretation, the estimated coefficient of stra is 0.91, which means, holding other variables unchanged, one unit increase of stra results in 0.91 unit increase of point. However, this effect is only in 0.1 significance level as p-value is smaller than 0.1 but greater than 0.05. The adjusted R-square is 0.195. The interpretation of this figure is that 0.195 of the variability in the dependent is explained by the explanatory variables.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model3)
Linear regression models have four main assumptions: - Linear relationship between predictors and outcome; - Independence of residuals; - Normal distribution of residuals; - Equal variance of residuals.
From the diagnostic plot, I am able to observe properties of the residuals. From Q-Q plot, the residuals match normal distribution quite well. From Residuals vs fitted graph, the residuals are randomly distributed on both sides of fitted values, and the distance from both sides are quite the same. In this way, I interpret the validity of those assumptions based on the diagnostic plots.
This work includes performing and interpreting logistic regression analysis. Specifically, this work includes plotting the data frame, fitting logistic regression model, and interpreting results .
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks.
date()
## [1] "Tue Dec 6 03:14:01 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. In this exercise, I choose famrel, studytime, age and freetime variables and study the relationship between the 4 variables and alcohol consumption. I assume that the famrel and studytime have nagative impact on alcohol consumption, while age and freetime have positive impact on it.
This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.
#distribution plots using bar plot
# initialize a plot of alcohol use
g_1 <- ggplot(data = alc, aes(x = alc_use))
g_1 + geom_bar()
g_2 <- ggplot(data = alc, aes(x = high_use))
g_2 + geom_bar()
g_3 <- ggplot(data = alc, aes(x = famrel))
g_3 + geom_bar()
g_4 <- ggplot(data = alc, aes(x = studytime))
g_4 + geom_bar()
g_5 <- ggplot(data = alc, aes(x = age))
g_5 + geom_bar()
g_6 <- ggplot(data = alc, aes(x = freetime))
g_6 + geom_bar()
# initialize a plot of high_use and famrel
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("family relationship")
# studytime and alchohol consumption
g2 <- ggplot(alc, aes(x = high_use, y = studytime))
g2 + geom_boxplot() + ylab("study time")
# age and alchohol consumption
g3 <- ggplot(alc, aes(x = high_use, y = age))
g3 + geom_boxplot() + ylab("age")
# freetime and alchohol consumption
g4 <- ggplot(alc, aes(x = high_use, y = freetime))
g4 + geom_boxplot() + ylab("freetime")
Part 3.3 shows the graphical overview of the variables. The distributions of the variables can be found in the bar plots. I find that the distributions of alcohol use,high_use, studytime and age is left-skewed. While the distributions of famrel and freetime are right-skewed. From the box plots, it seems that famrel and studytime are nagatively correlated with alcohol use. However, age and freetime seem to have no effect on alcohol use.
This part runs the logistic regression with the variables selected above
# find the model with glm()
m <- glm(high_use ~ famrel + studytime + age + freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + studytime + age + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7542 -0.8553 -0.6349 1.1722 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2910 1.8099 -1.818 0.069019 .
## famrel -0.3542 0.1315 -2.693 0.007086 **
## studytime -0.5725 0.1576 -3.632 0.000281 ***
## age 0.2221 0.1023 2.170 0.029983 *
## freetime 0.3793 0.1264 3.000 0.002701 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 415.07 on 365 degrees of freedom
## AIC: 425.07
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03721707 -6.88776678 0.23054128
## famrel 0.70173583 -0.61460703 -0.09722191
## studytime 0.56414042 -0.89131071 -0.27196009
## age 1.24866753 0.02362678 0.42612657
## freetime 1.46128242 0.13517480 0.63202911
From the table, I observe that the estimated coefficients of famrel and studtime are -0.35 and -0.57 respectively, and they are statistically significant concluded from p-value. The results indicate that famrel adn studytime have negative relationship with alcohol consumption, which jusitifies my assumption before. On the other hand, The estimated coefficients of age and freetime are positive(0.22,0.38 respectively). The p-value of age coeffienct is 0.03, which indicates that the coefficient is significate at 0.05 level but not at 0.01 level. In addition, the odds ratio, which is exponential of the estimated coeffients, and confidence interval are given in the table. the exponents of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable.
## prediction
## high_use FALSE TRUE
## FALSE 238 21
## TRUE 92 19
This part shows the predictive power of the model. A 2x2 cross tabulation is provided. The general predictive performance is good. However, a type of error is significant. When the high_use is true, the model always wrongly predict the type of alcohol use. The training error is 0.305, which is quite high. A better model may be needed. However, this model is still better than some simple guessing strategy.
In this part,a 10-fold cross-validation on my model is performed.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3189189
The average number of wrong predictions in the cross validation is 0.32, which is larger than 0.26. Unfortunately, my model does not have better test set performance than the model in exercise 3.
This work includes performing and interpreting clustering and classification. Specifically, this work includes plotting the data frame, fitting Linear discriminant analysis, prediction and K-means clustering .
Load the Boston data from the MASS package. The data is titled Housing Values in Suburbs of Boston and it contains 506 rows and 14 columns.
date()
## [1] "Tue Dec 6 03:14:03 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(Boston[1:7])
# draw the plot
p2 <- ggpairs(Boston[1:7], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2
p3 <- ggpairs(Boston[8:14], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p3
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
Part 4.2 shows the graphical overview of the variables. The distributions of the variables can be found in the second and third plots. The last plot is correlation plot which clearly shows the correlation among the variables.
This part makes some modifications to the data. Specifically, I scale the data, Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high", "high"), include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
In this part,a linear discriminant analysis on the test dataset is performed. The categorical crime rate is used as the target variable and all the other variables in the dataset as predictor variables. This part also draws the LDA (bi)plot.
# linear discriminant analysis
lda.fit <- lda(crime~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2524752 0.2301980 0.2747525
##
## Group means:
## zn indus chas nox rm age
## low 0.8920473 -0.9035864 -0.11163110 -0.8725260 0.38336014 -0.9105165
## med_low -0.1061801 -0.2404287 0.03646311 -0.5713425 -0.16873479 -0.3798224
## med_high -0.4309928 0.1772215 0.23568387 0.3646188 0.02558793 0.4216577
## high -0.4872402 1.0169382 -0.05951284 1.0646400 -0.40860638 0.8199605
## dis rad tax ptratio black lstat
## low 0.8552000 -0.6806916 -0.7484115 -0.43099681 0.3729220 -0.76050575
## med_low 0.3548719 -0.5551369 -0.4885286 -0.04919939 0.3066748 -0.10362941
## med_high -0.3678746 -0.4619738 -0.3346990 -0.21289678 0.1488695 0.03074016
## high -0.8578063 1.6399445 1.5153545 0.78289123 -0.7709400 0.90004786
## medv
## low 0.50746340
## med_low -0.01422682
## med_high 0.16069693
## high -0.67779022
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11217067 0.65648115 -0.884220031
## indus 0.14813560 -0.12355216 0.644754162
## chas -0.03235825 -0.09049877 0.081240356
## nox 0.29538541 -0.81516959 -1.381920873
## rm 0.01443438 -0.08695307 -0.082203058
## age 0.20779848 -0.42609727 -0.221064089
## dis -0.04471161 -0.25878751 0.292077628
## rad 4.09324864 1.12913006 -0.007073554
## tax 0.02247858 -0.17825783 0.364657182
## ptratio 0.12751763 -0.14344786 -0.223837068
## black -0.05762458 0.03621903 0.080461503
## lstat 0.13811501 -0.15386206 0.472327880
## medv 0.03024809 -0.35803077 -0.168071340
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9669 0.0248 0.0083
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 4,col = classes)
lda.arrows(lda.fit, myscale = 3)
In this chaper, I predict the crime classes with the
test data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lda.fit, correct_classes and test are available
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 6 3 0
## med_low 5 14 5 0
## med_high 3 7 19 4
## high 0 0 0 16
The cross tabulate results indicates a quite good results of prediction based on LDA. One exception is that when the correct data is med_high, the probability that the model wrongly predicts it with med_low is quite high(15/30).
This part calculates the distances between the observations. Run k-means algorithm on the dataset and investigate the optimal number of clusters,
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# scale the data
Boston<- as.data.frame(scale(Boston))
Boston$crim <- as.numeric(boston_scaled$crim)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.5760 4.9401 4.9913 6.3033 12.8856
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2645 8.9648 13.2765 14.1297 18.5263 46.5452
set.seed(13)
# k-means clustering
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10],, col = km$cluster)
The optimal number of clustes are 2. This is determined by observing that the total WCSS drops radically when k is 2. In the pairs plot, I notice that the tax varibale seems to effect the clustering results mostly.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color = ~train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color= ~km$cluster[1:404])
This work includes performing and interpreting dimensionality and reduction techniques. Specifically, this work includes plotting the data frame, Perform principal component analysis(PCA), and multiple correspondence analysis(MCA).
Load the human data from the website. The data is titled human data and it contains 155 ovservations and 8 variables.
date()
## [1] "Tue Dec 6 03:14:14 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt",
sep =",", header = T)
# explore the dataset
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# remove the Country variable
#human_ <- select(human, -Country)
# Access GGally
library(GGally)
# visualize the 'human_' variables
ggpairs(human[1:4])
ggpairs(human[5:8])
# Access corrplot
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
corr<-cor(human)
corrplot(corr)
Part 5.1 shows the graphical overview of the variables. The distributions of the variables can be found in the second and third plots. From where we conclude that almost all variables are skewed distributed. The last plot is correlation plot which clearly shows the correlation among the variables.
This part Performs principal component analysis (PCA) on the raw (non-standardized) human data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# pca_human, dplyr are available
pca_human <- prcomp(human)
# create and print out a summary of pca_human
s <- summary(pca_human)
biplot(pca_human, choices = 1:2,cex = c(1, 1),col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
From the PCA result, the first priciple component captures all the variability, while the second captures 0.
This part Performs principal component analysis (PCA) on the standardized human data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
human_<-scale(human)
# pca_human, dplyr are available
pca_human_ <- prcomp(human_)
# create and print out a summary of pca_human
s <- summary(pca_human_)
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
pc_lab<-pca_pr
# draw a biplot
biplot(pca_human_, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The PCA on standardized data is totally different from the PCA on raw data. In the part 5.2 where I perform PCA on raw data, the first PC accounts for 100 percent of the variability and also the variable GNI accounts almost 100% of the first component. This is because the value of variable GNI is way too large. After standardization, the results become normal. The first component captures 53.6 percent of the total variability where the second captures 16.2. Also from this plot, I am able to find the correlation between the variables and the correalation between the variables and the PCs. This is done by noticing that: - The angle between the arrows can be interpreted as the correlation between the variables. - The angle between a variable and a PC axis can be interpreted as the correlation between the two. - The length of the arrows are proportional to the standard deviations of the variables.
In this part,I interpret the PCs. From the plot and interpretation method I discribe above, the variables Labor.FM and Parli.F mainly contribute to PC2, while the others mainly contribute to PC1.
In this part, I explore the tea data and perform Multiple Correspondence Analysis (MCA) on it. The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions).
# tea_time is available
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+ geom_bar()+ theme(axis.text.x = element_text(angle=20, hjust = 1, size = 8))
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
From the MCA factor map, I observe clearly how the variables account for each dimension. For example, the green variable only account for the dimension 2 while tea bag mostly account for dimension 1.
(more chapters to be added similarly as we proceed with the course!)